Overview

Our project used data from Kaggle’s 2013 Yelp Challenge. This challenge included a subset of Yelp data from the metropolitan area of Phoenix, Arizona. Our data takes into account user reviews, ratings, and check-in data for a wide-range of businesses.

Data Aquisition & Transformations

Data was acquired and transformed in the preprocessing.R file located within our repositories final-project folder. Our data source was provided as multiarray Json files, meaning each file is a collection of json data. We used stream_in function, which parses json data line-by-line from the data folder of our repository. The collections included three, large data for Yelp businesses, users, and reviews.

Once obtained, we prepared our data for our recommender system using the following transformations:

Business

We choose to limit the scope to our recommender system to only businesses with tags related to food and beverages. There were originally 508 unique category tags listed within our business data. We manually filtered 112 targeted categories to subset our data.

We applied additional transformation to remove unnessacary data. There were 1,224 business in our data that were permanently closed. These companies accounted for 9.8% of all businesses, which were subsequently removed from our data. There were also 3 businesses in our dataset from outside of AZ that we also removed.

As a result of our transformations, our recommender data was shortened 4,828 unique businesses. This was further limited to 4,332 after randomly sampling our user-data. The output of which can be previewed below:

Preview Business Data
business_id categories city name longitude state latitude
usAsSV36QmUej8–yvN-dg Food, Grocery Phoenix Food City -112.0854 AZ 33.39221
PzOqRohWw7F7YEPBz6AubA Food, Bagels, Delis, Restaurants Glendale Az Hot Bagels & Deli -112.2003 AZ 33.71280
qarobAbxGSHI7ygf1f7a_Q Sandwiches, Restaurants Gilbert Jersey Mike’s Subs -111.8120 AZ 33.37884
JxVGJ9Nly2FFIs_WpJvkug Pizza, Restaurants Scottsdale Sauce -111.9263 AZ 33.61746
Jj7bcQ6NDfKoz4TXwvYfMg Burgers, Restaurants Phoenix Fuddruckers -112.1162 AZ 33.56699
JHp5mJvYe6UtM_QsklR-iw Pizza, Restaurants Scottsdale Peter Piper Pizza -111.9175 AZ 33.46613

Review

We subset our review data from the subset of food and beverage businesses. This dropped our review data from 229,907 to 165,823 reviews. We later applied another filter to the data to only use reviews from 10,000 randomly sampled users. This further decreases reviews to 44,494 observations. Our review data can be previewed in two parts below:

Preview Review Data (without Review Text)
votes.funny votes.useful votes.cool user_id review_id stars date business_id
4 7 7 wFweIWhv2fREZV_dYkz_1g riFQ3vxNpP4rWLk_CSri2A 5 2010-02-12 zp713qNhx8d9KCJJnrw1xA
2 4 3 SBbftLzfYYKItOMFwOTIJg HXP_0Ul-FCmA4f-k9CqvaQ 3 2008-10-12 supigcPNO9IKo6olaTNV-g
1 4 2 C6IOtaaYdLIT5fWd7ZYIuA MuqugTuR5DdIPcZ2IVP3aQ 3 2008-10-08 8FNO4D3eozpIjj0k3q5Zbg
1 4 2 RRTraCQw77EU4yZh0BBTag B5h25WK28rJjx4KHm4gr7g 4 2008-03-21 wct7rZKyZqZftzmAU-vhWQ
0 1 0 kpbhy1zPewGDmdNfNqQp-g hre97jjSwon4bn1muHKOJg 4 2012-07-12 i213sY5rhkfCO8cD-FPr1A
0 1 0 8AMn6644NmBf96xGO3w6OA S9OVpXat8k5YwWCn6FAgXg 1 2012-05-04 vvA3fbps4F9nGlAEYKk_sA
Preview of a Singular Review Text
text

Drop what you’re doing and drive here. After I ate here I had to go back the next day for more. The food is that good.

This cute little green building may have gone competely unoticed if I hadn’t been driving down Palm Rd to avoid construction. While waiting to turn onto 16th Street the “Grand Opening” sign caught my eye and my little yelping soul leaped for joy! A new place to try!

It looked desolate from the outside but when I opened the door I was put at easy by the decor, smell and cleanliness inside. I ordered dinner for two, to go. The menu was awesome. I loved seeing all the variety: poblano peppers, mole, mahi mahi, mushrooms…something wrapped in banana leaves. It made it difficult to choose something. Here’s what I’ve had so far: La Condesa Shrimp Burro and Baja Sur Dogfish Shark Taco. They are both were very delicious meals but the shrimp burro stole the show. So much flavor. I snagged some bites from my hubbys mole and mahi mahi burros- mmmm such a delight. The salsa bar is endless. I really stocked up. I was excited to try the strawberry salsa but it was too hot, in fact it all was, but I’m a big wimp when it comes to hot peppers. The horchata is handmade and delicious. They throw pecans and some fruit in there too which is a yummy bonus!

As if the good food wasn’t enough to win me over the art in this restaurant sho did! I’m a sucker for Mexican folk art and Frida Kahlo is my Oprah. There’s a painting of her and Diego hanging over the salsa bar, it’s amazing. All the paintings are great, love the artist.

User

Last, we applied a similar filter to users to subset our data based on only our selected businesses. This decreased our user data from 43,873 to 35,268 distinct user_id observations. Do to processing constraints in R, we choose to randomly sample 10,000 users from these unique profiles.

The dataframe preview below shows aggregate user data for all reviews an individual user provided for yelp within our data selection.

Preview User Data
user_id user_name review_count votes.funny votes.useful votes.cool average_stars
–lMCM6K8-9NTvPlbCMXEA Anne Marie 1 0 0 0 4.0
–LzFD0UDbYE-Oho3AhsOg Shumai 1 0 1 0 4.0
–M-cIkGnH1KhnLaCOmoPQ Emma 1 2 2 2 5.0
-01H9S7YxFrhRgNdvxmaVQ Marc 1 0 0 0 5.0
-06LYbA4Qm_9E83KNT1Jrg Brett 2 0 0 0 4.5
-0Ycl6yN0BsX1U70-SZOYw Kate 2 0 0 0 4.0

Merge Data

Next, we created our main dataframe by merging business and reviews on Business_ID. This dataframe will serve as the source of data for our recommender algorithms. The user and business unique keys were simplified from characters to numeric user/item identifiers.

This dataframe will be referenced later on when building our recommender matrices and algorithms. Review details were omitted in the preview for brevity.

Preview main dataframe
business_id categories city name longitude state latitude votes.funny votes.useful votes.cool user_id review_id stars date userID itemID
usAsSV36QmUej8–yvN-dg Food, Grocery Phoenix Food City -112.0854 AZ 33.39221 0 0 0 1Eevry0X_8yb6yzsQilptg F-R4pX3Ane7y3VlswhWrrQ 3 2011-11-20 1 1
PzOqRohWw7F7YEPBz6AubA Food, Bagels, Delis, Restaurants Glendale Az Hot Bagels & Deli -112.2003 AZ 33.71280 0 1 0 Iycf9KNRhxvR187Qu2zZHg hg7rapz_KzAqhoOFYhXVoQ 4 2012-06-11 2 2
qarobAbxGSHI7ygf1f7a_Q Sandwiches, Restaurants Gilbert Jersey Mike’s Subs -111.8120 AZ 33.37884 1 0 0 4UypETvlv8cl0jKFxHh3Zw OhWvwGTbiuT4tnLpK-iC-w 2 2012-08-27 3 3
qarobAbxGSHI7ygf1f7a_Q Sandwiches, Restaurants Gilbert Jersey Mike’s Subs -111.8120 AZ 33.37884 0 1 0 5j7qmDZTAetaH0yXFnAFyw rTghOy2OZxdmI6ofRzI0Bg 3 2012-03-09 4 3
qarobAbxGSHI7ygf1f7a_Q Sandwiches, Restaurants Gilbert Jersey Mike’s Subs -111.8120 AZ 33.37884 1 2 1 uNbB1uR4EBhmygUc3IfPAw EY-eYBoXIjn2k2X_ZDTpJA 4 2012-05-10 5 3
JxVGJ9Nly2FFIs_WpJvkug Pizza, Restaurants Scottsdale Sauce -111.9263 AZ 33.61746 0 0 0 l_6XDatGLHfkGxl8BjI2Ag imbU3ZZlDf5SIKHkaEskaw 5 2011-09-22 6 4

Visualize Data

Add data visualizations.

Recommender Algorithm

We tested 3 recommender algorithms to see which had the best performance metrics for our recommender system. To test the algorithsm, we first had to create a user-item matrix and then split our data into training and test sets.

Matrix Building

We converted our raw ratings data into a user-item matrix to test and train our subsequent recommender system algorithms. The matrix was saved as a realRatingMatrix for processing purposes later on using the recommenderlab package.

The matrix data can be viewed below.

Train and Test Splits

Our data was split into training and tests sets for model evaluation of both two recommender algorithms. We split our data with 10 k-folds using the recommenderlab package. 80% of data was retained for training and 20% for testing purposes.

Algorithm 1 (Raj)

Now that we have the user and Business Rating adjsusted where 0 indicates No Feedback, -1 Indicates Negative Feedback and 1 indicates postive feedback.

I decided to use Jaccard Distance to measure the similarity between Busienss profiles,

Algorithm 2 (Christina)

Algo

Algorithm 3 (Juliann)

Set up:

FALSE [1] '2.1.0'

ALS predictions

Prediction output.
Preview of prediction output
user_id item stars prediction
fczQCSmaWF78toLEmb0Zsw 173 - Lalibela Ethiopian Cafe 4 3.6
fczQCSmaWF78toLEmb0Zsw 2696 - Sakana Sushi & Grill 4 4.1
fczQCSmaWF78toLEmb0Zsw 3203 - Renegade Tap & Kitchen 3 4.6
fczQCSmaWF78toLEmb0Zsw 2120 - Sala Thai 4 3.5
fczQCSmaWF78toLEmb0Zsw 3951 - Benihana 3 3.1
fczQCSmaWF78toLEmb0Zsw 3059 - Julia Baker Confections 5 4.8

Evaluate performance metrics.

FALSE               [,1]
FALSE mseSpark  1.924279
FALSE rmseSpark 1.387184
FALSE maeSpark  1.091038
FALSE NULL

Analysis

Compare algorithms performance. Select most effective to build recommender system.

Recommender System

Test system

Conclusion

Final conlusion. Explain limitations of system. Make recommendations for future improvements.

References